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PACS 75. lO.Nr - Spin-glass and other random models 

Abstract. - We introduce a new distributed algorithm for aligning graphs or finding substructures 
within a given graph. It is based on the cavity method and is used to study the maximum-clique 
and the graph-alignment problems in random graphs. The algorithm allows to analyze large 
graphs and may find applications in fields such as computational biology. As a proof of concept 
we use our algorithm to align the similarity graphs of two interacting protein families involved in 
bacterial signal transduction, and to predict actually interacting protein partners between these 
families. 



Over the last decade, the use of graphs for the descrip- 
tion of relations between components of complex systems 
has become increasingly popular [I]. However, most part 
of the current literature concentrates on (computationally 
accessible) local characteristics like node degrees, whereas 
the full exploitation of more global properties of large net- 
works remains frequently elusive due to their inherent al- 
gorithmic complexity. Most often studying global prop- 

' erties requires solving NP-hard problems or even harder 
problems if some form of uncertainty or lack of information 
is included in the definition of the problem. In both cases 
heuristic algorithms need to be developed. Specific exam- 
ples, covered by this article, include the comparison of two 
different networks, the so-called graph- alignment problem 

' (GA) [IHlj, and the sub-graph isomorphism (SGI), as a 
particular case of which we consider the widely studied 
maximum-clique problem jlHZ]- 

Recently, there has been a lot of interest in distributed 
algorithms to deal with optimization problems over net- 
works. In the context of statistical physics a new genera- 
tion of algorithms has been developed (e.g. [5115] ) that have 
shown promising performance on several applications (for 
a review, see [10]). These techniques are based on the so 
called cavity method and are known as message-passing 
(MP) algorithms. They are fully distributed and easy to 
run on parallel machines. A recent result in this frame- 
work is an algorithm for finding a connected sub-graph 
of a given graph which optimizes a given factorized cost 
function [TT] . 

Here we aim at making a step forward by introducing 



new techniques for SGI and GA. We develop two alter- 
native MP strategies and test their performance on three 
sample problems. The first two are well-defined theoretical 
benchmarks, where our results can be compared to rigor- 
ous bounds: (i) the maximum-clique problem in random 
graphs for the SGI problem; and (ii) the alignment of two 
random graphs of controlled similarity. The third sample 
problem is thought as a proof-of-concept application in 
computational biology: We study (iii) the alignment of the 
similarity networks of two interacting protein-domain fam- 
ilies involved in bacterial signal transduction, to identify 
actual signaling pathways. This case, involving large net- 
works of >2500 nodes, exploits co-evolutionary processes 
between interacting proteins to identify interaction part- 
ners [12j . 

The model — Both problems, SGI and GA, can be 
put into the common framework of matching two graphs 
of possibly different size. Let G = {V, E, w) and G' — 
{V',E',w') be two weighted graphs with nodes V,V', 
edges E,E' and edge weights w,w'. In the applications 
shown in this letter, weights are non-negative, but this 
is not a necessary condition for the applicability of the 
message-passing algorithms. In the case of unweighted 
graphs, we assume w and w' to describe the adjacency 
matrices, i.e. weights are one if an edge is present be- 
tween two vertices, and zero else. Further more we denote 
the node number by A^ = \V\ {N' = \V'\), and the edge 
number by M = \E\ (A/' = |£;'|). Neighbors of a node i 
are assembled in di. To facilitate notation, primed quan- 
tities (in particular node indices) will always refer to G' . 
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Without loss of generality, we assume N < N'. 

The problem is now to find an injective mapping n : 
V ^ V between the nodes of G and G' . This mapping 
minimize the cost function (Hamiltonian) 



(1) 



Note that in the case of unweighted graphs the first term 
counts the number of overlapping links in the graphs G 
and G' . The cw denote similarities between nodes of the 
two graphs, i.e. they provide a local bias for the mapping tt 
which, in physical terms, represents a local external field. 
Computationally this problem is very hard: The complex- 
ity of a simple enumeration is 0{N'^), i.e. it is growing 
more than exponentially for growing N and N'. It is a 
special case of the so-called quadratic assignment problem, 
which would contain any cost e(i, j, tt^, ttj) in the first sum 
of the Hamiltonian. The generalization of our algorithm 
to this case is straight-forward. A major problem is to im- 
plement the injectivity constraint: for i ^ j also tt^ ^ iTj 
has to hold. We introduce two strategies to treat this con- 
straint within approximate algorithmic approaches based 
on message passing: 

(i) We directly introduce this constraint for each pair 
of nodes in G, using a complete graph of these vertices, 
where each link carries the constraint. The Boltzmann 
distribution (at finite formal temperature 1//3) reads 



-/3W(7r) 



(2) 



iJeV:i<j 



The application of message passing to this formulation of 
the problem requires the exchange of vectorial messages of 
dimension N' along all links of the complete graph of con- 
straints, so the complexity becomes 0{N^N'), cf. Eq. ^ 
[5]) below. 

(ii) We relax the constraint, i.e. we consider arbitrary 
mappings tt : V ^ V and then we couple a chemical 
potential p to the number TV^r = |7r(y)| of images of tt: 



P(2)(7r) 



-l3-H(Tr}+l3pN„ 



(3) 



To be usable in a message-passing approach, we further 
express the image number as N-^ = X^i'ev Xi'{''^)i '^itti 
Xi'(7r) = 1 if there exists an i G F having i' — tt^ as it's 
image, and xv (''') = ^Ise. For sufficiently large but finite 
values of the ground states have — N, and injectivity 
is restored. This leads to a slightly more involved message- 
passing algorithm (cf. below) whose time complexity goes 
down to 0{{N + M)N'); this formulation is favorable in 
particular for large sparse graphs G. 

Message-passing algorithms — An exact treatment of 
the two approaches gives equivalent results for /3 oo, 
but it is infeasible for large N due to the super-exponential 
time complexity. Here we develop two heuristic algorithms 
using MP. 

(i) The first algorithm is a straight forward applica- 
tion of belief propagation (BP) to the problem defined in 



Eq. ©. Messages are exchanged between any two nodes 
in V, and they are of dimension N': 



(4) 



Messages are standard cavity probabilities and biases [10] , 
with Pi^j{TTi) being the marginal probability of node i in 
the cavity graph which is constructed by removing j from 
the node set V (the cavity graph is thus a complete graph 
of — 1 nodes). Message Qi^j{TTj) describes the bias in- 
duced by node i on the mapping of node j, including both 
the Boltzmann factor of the weights of aligned edges (i, j) 
and (TTi,iTj) and the injectivity constraint (1 — Jjri.Trj )• The 
BP equations can be solved iteratively, and the marginal 
probability that node i V chooses tt^ e V' as its image 
is given by 

p}'\n,)o,ef'-"^l[Qk^,i7:,) . (5) 

(a) Giving a full derivation of the BP equations for 
analysing Eq. (|3]) goes beyond the scope (and space lim- 
itations) of this letter. We will give, however, some in- 
dications about the main steps: The factor graph corre- 
sponding to P^^)(7r) has N variable nodes i £ V, each 
one carrying a A^'-state spin variable tt^. There are two 
types of factor nodes: The first type corresponds to the 
M edges («,j) G E of graph G and measures the align- 
ment weight; the second type corresponds to each of the 
A^' possible image nodes i' G V and depends on the in- 
dicator (tt) if node i' is selected as an image or not. 
These factor nodes are a priori problematic, since each 
one is connected to all variable nodes i G V. Applying 
naively BP requires a priori a summation over 0{N'^) 
terms. However, the symmetry structure of the problem 
is the same as the one of (soft-constraint) affinity propa- 
gation [T3UT5] : Even if messages from a factor node i' to 
a variable node i are given as A^' dimensional vectors of 
the form Ai'^i^ni), they contain just two different entries 
for TTi = i' and tt^ ^ i' . The before- mentioned sums can 
be performed analytically, cf. [TSHlSj . Here we state only 
the final equations: 



A,. 
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Messages A and B are exchanged between the nodes of 
the two graphs, they have a nice intuitive interpretation: 
Bi^i' is a request of i to i'; it measures in how far i would 
like to select i' as his image. According to the second of 
Eqs. ©it depends on the node similarity Ci^ii as compared 
to the similarities Ciji of i to all other j' ^ i', and on 
the Q-messages containing the alignment weight for links 
£ G. Message Aii^i indicates the availability oii' to 
become image of i. It is large if the requests Bj^ii from 
other nodes j ^ i to i' are small, favoring therefore large 
Nt^. Messages Q and P are exchanged along the edges 
of G. After finding a fixed point of these equations, the 
marginal for i G V is given by 



oc e 



N- 
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i(7ri) 



(7) 

Being of approximate nature, it is not clear that these 
two MP strategies lead to the same results applied to the 
same graphs, but in practice we did not observe system- 
atic differences. The major difference is the time com- 
plexity: Whereas for relatively small G (as in the sub- 
graph isomorphism discussed below) the first strategy was 
found to be faster, the two applications concerning larger 
but sparse G (alignment of random graphs and protein- 
similarity graphs) are solved faster by the second strategy. 

The BP equations are fixed-point equations, and can be 
solved by iterating Eqs. (HI)-® resp. dU. This is not guar- 
anteed to converge, nor there is a guarantee that only one 
fixed point exists. Since in optimization we are interested 
in constructing one solution, this problem can be circum- 
vented by enforcing convergence through the so called soft 
decimation or reinforcement technique 16|. 

In the case of the first algorithm (Eqs. it amounts 
to multiplying at each step the right hand side of Eqs. (HJ 



and ([5]) by the term 



marginal probability, as computed in the previous step 
with the (modified) Eq. ([5]), to a scalar time dependent 
7t . For the simulations in this work, we used 74 = 1 — a* 
for a close to one. 

Note that BP bears some similarity with the IsoRank 
algorithm [4]. There are, however, some crucial differ- 
ences: (i) BP is derived from a variational minimization 
of cost function ([T]); (ii) IsoRank is a mean- field algorithm, 
whereas BP is based on the more precise Bethe approxi- 
mation; (iii) IsoRank has no explicit control over the in- 
jectivity of the resulting alignment. 

Finding maximum cliques — In the following, we are 
going to test BP on three sample problems. The setting 
of the parameters is /3 = 00 for version (i) and /3 = 10 and 
p = —40, —100 for version (ii) of the algorithm. 

The first application concerns finding the maximum 
clique in a given graph. On the computational side, this 
is indeed a root problem being both NP-complete [5] and 
difficult to approximate [5] • Within our previous notation, 
graph G now is a complete graph with M = N{N — l)/2, 
and we are trying to embed it into a second graph G' of 
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Fig. 1: In the left panel, we report the entropy of cliques of size 
4, as a function of the order A^' for various values of a. Symbols 
are BP results (each symbol averaged over 10 random graphs), 
full lines give the logarithm of the expected number of such 
cliques. On the right, we display the largest clique found by 
BP in graphs of size A*'' — 100, as a function of a (each symbol 
averaged over 10 random graphs). Results are consistent with 
theoretical bounds. 



normally much larger order N' . Node similarities Ci,i' are 
set to zero. In the specific case that G" is a random graph 
with TV' nodes and edge probability N''", a € (0,1), 
rigorous bounds for the maximum clique size cl{G') are 
known [7]: 



J log log N' 
\osN' 



< d{G') < 



J log log A^' 
logiV' 



(8) 



with 



fco(iV') = 



, log a 

'a log N' 



log- 



a log N' 



+ 1 + 0(1). (9) 



As usual in random-graph theory, these bounds hold with 
probability tending to one in the limit N' ^ 00 oi large 
target graphs G". Eq.l^lis derived from the expected num- 
ber of cliques of a given size, which has to be of order 
1. This expected number allows us also to estimate the 
number of smaller cliques in G', which can be directly 
compared to the Bethe entropy calculated by BP. 

Results of this comparison are presented in Fig. [TJ Fi- 
nite size effects set in for small entropy values, which cor- 
respond to the logarithm of the maximum cliques number, 
while for larger TV' the BP results perfectly coincide with 
the theoretical predictions. We note that in the case of 
cliques, and more in general for highly symmetric sub- 
graphs, it is possible to exploit symmetries to simplify the 
BP equations. The resulting equations have slightly dif- 
ferent convergence properties compared to the generic BP 
ones. The largest cliques size found by our algorithm are 
in agreement with the (fairly tight) rigorous bounds also 
for small value of N' = 100, see right panel of Fig. [T] In 
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the explicit constructions of cliques both the decimation 
and the reinforcement techniques have been used (the lat- 
ter one being substantially faster). 

Finding and couting small subgraphs is one of the ma- 
jor steps in the search for network motifs [T71 - [T5] . In con- 
trast to exhaustive algorithms [T71II1], BP is able to han- 
dle the problem even for medium size subgraphs. Furthei 
more, working at finite temperature allows for non-perfecl 
alignments, an thus for identifying, e.g., dense subgraphs 
instead of perfect cliques. A detailed exploration of this 
possibility goes beyond the scope of this letter. 

Aligning sparse random graphs — In the second prob- 
lem, we use BP to align two sparse random graphs G anc 
G' with identical numbers of nodes, N = N', and links 
M = M' . The injective mapping tt thus becomes a permu- 
tation. To study the best GA as a function of the inhereni 
similarity between the two graphs, we construct G and G' 
such that they have M — M^and links in common, the 
other Mrand are chosen independently in the two graphs 
[20] . Note that for Mrand = 0, the problem reduces to 
identify an isomorphism between the graphs. Due to the 
specific construction this isomorphism is trivially given by 
the identical permutation, ni — i, but this information is 
neither known nor exploitable by the algorithm; it serves 
only for a simple evaluation of the simulation results. For 
Mrand — M, the two graphs are independent, and the 
number of alignable links is monotonously decreasing with 

Mrand- 

Results for the alignment of G and G' without node sim- 
ilarities (ci.^; = 0) are given in the first panel of Fig. [2] 
For Mrand = 0, BP always identifies correctly the isomor- 
phism between the two graphs. In the other limiting case, 
Mrand = M the number of aligned links is much smaller, 
and depends on the graph realization and the initial con- 
dition of the BP messages. In between the two extremes 
we find a transition in the algorithmic behavior at some 
Mrand wherc the identical permutation has the same value 
of H as the best alignment of two independent graphs. 
Above Mrand, the number of aligned links is found to be 
almost constant, and equals the independent-graph case. 



For < Mr 



< Mr 



the BP solutions fall into two 



diflterent classes: one being close to the identical permu- 
tation with a high number of aligned links (green symbols 
in Fig. [5]), and one having "H- values coherent with the 
alignment of two independent graphs (red symbols) . The 
relative fraction of the first case is shown in the inset, it 
decreases when approaching Mrand- 

The behavior changes, when node similarities 
(cf. Eq. ([1])) are used to bias BP toward the identi- 
cal permutation. We set Ci^T^- = cSi^T^- for i = 1, if, and 
Ci,TTi = for i > if, with K G {0, A^}. The parameter 
K controls the number of biased nodes, and c controls the 
strength of this bias. Interestingly, already a low value 
of K is sufficient to let BP always find the lower energy 
solutions in the region < Mrand < Mrand- Results for 
different values of K and c are shown in the second panel 
of Fig [21 We observe that excessively high values of K 



or c decrease the performance of the algorithm for large 
Mrand by forciug it towards solutions of less aligned links 
but more self-aligned nodes. 
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Fig. 2: The number of aligned links as a function of the num- 
ber Mrand of independent links in G and G' , for N = N' = 80 
and M — M' = 140 (each data point averaged over 30 random- 
graph pairs and 5 BP runs). First panel: Results for alignment 
without node similarities {K = 0). For positive but not two 
large Mrand the solutions fall into two classes: Green sym- 
bols represent close-to-identical permutations of high number 
of aligned links, red symbols represent results which are close 
to the alignment of two independent graphs. The inset shows 
the fraction of BP runs leading to a close-to-identical permuta- 
tion. Second panel: Results for different values of the similar- 
ity parameters K and c, the best performance is obtained for 
K — 10 and c = 1 (small bias). The inset shows the number of 
self-aligned nodes, it drops when the bias is not too large and 

Mrand > Mrand- 

Finding interacting protein partners from multi-species 
sequence data — Despite the fundamental importance 
of protein-protein interactions in most biological pro- 
cesses, identifying interaction partners is experimentally 
and computationally a major problem. As a proof-of- 
concept application for GA we consider two signaling pro- 
teins, namely a histidine sensor kinase (SK) and a re- 
sponse regulator (RR). Their interaction forms the cen- 
tral part of two-component signal transduction (TCS), 
which is the most prominent signal transduction mecha- 
nism in bacteria. Each bacterium contains 0(10) interact- 
ing SK/RR pairs forming different TCS pathways; and the 
necessity to trigger the correct answer for each specific ex- 
tracellular signal forbids crosstalk between pathways. So, 
even if all different SK in one species are homologous (and 
therefore structurally and functionally similar), and the 
same is true for all RR, only specific samples of these two 
protein families interact. Our question here is, if GA can 
help to identify interaction partners. 

We start from a large collection of SK sequences ex- 
tracted from hundreds of bacterial genomes, and a second 
large collection of RR sequences coming from the same 
bacteria, and we aim at extracting interacting SK/RR 
pairs, exploiting sequence similarities of proteins inside 
each family. The basic idea is simple: Two SK with very 
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similar amino-acid sequences will (due to their probably 
recent common evolutionary origin) interact with two sim- 
ilar RR. Globally spoken, an alignment of two similarity 
networks - one for the SK family, one for the RR family - 
might be able to pair a large fraction of all those SK and 
RR which actually belong to common TCS pathways [12] . 
Our data set consists of two multiple-sequence alignments 
(with gaps) for 2546 SK and 2546 RR proteins from 231 
genomes pTI. They are selected such that, due to the fre- 
quent coding of an entire TCS in one operon, the correct 
mapping is known, and can be used a posteriori to verify 
our GA results. 

Similarity networks for each protein family are con- 
structed as fcNN graphs: Each protein is linked to the 
k most similar proteins, where similarity is measured via 
the Hamming distance dij between the aligned aminoacid 
sequences of two proteins i and j . The link weight is given 
as Wij = exp [— fifj/fifc], with dk being the average distance 
between each protein and its A:th neighbor. One might 
use more sophisticated distance measures (e.g. alignment 
scores), but due to the proof-of-concept character of this 
application we have chosen the simplest possible measure. 
To identify interaction partners, we must align only pro- 
teins inside the same species (formally implemented by 
Ci^ii — — oo for all i and i' belonging to different species). 
Finally, we have introduced various amounts of informa- 
tion about real interaction partners, by randomly intro- 
ducing positive similarities between a number of actual 
interaction partners (training set). The results are sum- 
marized in the following table for different k and training- 
set sizes. Error bars result from an average over different 
random training sets. The values display the fraction of 
correctly aligned protein pairs in between all proteins not 
being in the training set. 



training set 


3NN, yfc = 3 


6NN, fc = 6 


9NN, fc = 9 


2000 


88.7 ± 1.7 


89.8 ± 1.9 


90.5 ± 1.7 


1000 


76.2 ± 1.3 


78.7 ± 1.0 


79.6 ± 0.8 


500 


67.4 ± 1.9 


73.1 ± 1.4 


75.0 ± 1.0 





48.1 


58.9 


64.7 



We note that even without training set, almost 65% of 
all proteins are correctly matched (for A: = 9). This num- 
ber has to be compared to a random matching, where only 
231/2546 ~ 9% correct matchings would be expected. The 
introduction of a training set improves strongly the perfor- 
mance, for a training set of 2000 protein pairs, about 90% 
of the remaining 546 proteins are correctly aligned. These 
results beautifully demonstrate that the original idea to 
exploit sequence similarity of proteins across species is ac- 
tually providing information about who is interacting with 
whom. 

Conclusion — In this letter we have presented a dis- 
tributed (parallel) algorithm for graph-alignment prob- 
lems. The new technique is based on the cavity method 
and allows to deal efficiently with optimization problems 
under (global) topological constraints. The results on fa- 



mous problems such as sub-graph isomorphism and graph 
alignment on random graphs are in remarkable agreement 
with known rigorous bounds. Problems of this type are of- 
ten encountered in the analysis of large-scale data in many 
fields of science, computational biology in first place. 
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